10 research outputs found

    The preprocessed doacross loop

    Get PDF
    Dependencies between loop iterations cannot always be characterized during program compilation. Doacross loops typically make use of a-priori knowledge of inter-iteration dependencies to carry out required synchronizations. A type of doacross loop is proposed that allows the scheduling of iterations of a loop among processors without advance knowledge of inter-iteration dependencies. The method proposed for loop iterations requires that parallelizable preprocessing and postprocessing steps be carried out during program execution

    Run-time parallelization and scheduling of loops

    Get PDF
    The class of problems that can be effectively compiled by parallelizing compilers is discussed. This is accomplished with the doconsider construct which would allow these compilers to parallelize many problems in which substantial loop-level parallelism is available but cannot be detected by standard compile-time analysis. We describe and experimentally analyze mechanisms used to parallelize the work required for these types of loops. In each of these methods, a new loop structure is produced by modifying the loop to be parallelized. We also present the rules by which these loop transformations may be automated in order that they be included in language compilers. The main application area of the research involves problems in scientific computations and engineering. The workload used in our experiment includes a mixture of real problems as well as synthetically generated inputs. From our extensive tests on the Encore Multimax/320, we have reached the conclusion that for the types of workloads we have investigated, self-execution almost always performs better than pre-scheduling. Further, the improvement in performance that accrues as a result of global topological sorting of indices as opposed to the less expensive local sorting, is not very significant in the case of self-execution

    Run-time parallelization and scheduling of loops

    Get PDF
    Run time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases, where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run time, wave fronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run time reordering of loop indices can have a significant impact on performance. Furthermore, the overheads associated with this type of reordering are amortized when the loop is executed several times with the same dependency structure

    Run-time scheduling and execution of loops on message passing machines

    Get PDF
    Sparse system solvers and general purpose codes for solving partial differential equations are examples of the many types of problems whose irregularity can result in poor performance on distributed memory machines. Often, the data structures used in these problems are very flexible. Crucial details concerning loop dependences are encoded in these structures rather than being explicitly represented in the program. Good methods for parallelizing and partitioning these types of problems require assignment of computations in rather arbitrary ways. Naive implementations of programs on distributed memory machines requiring general loop partitions can be extremely inefficient. Instead, the scheduling mechanism needs to capture the data reference patterns of the loops in order to partition the problem. First, the indices assigned to each processor must be locally numbered. Next, it is necessary to precompute what information is needed by each processor at various points in the computation. The precomputed information is then used to generate an execution template designed to carry out the computation, communication, and partitioning of data, in an optimized manner. The design is presented for a general preprocessor and schedule executer, the structures of which do not vary, even though the details of the computation and of the type of information are problem dependent

    Krylov methods preconditioned with incompletely factored matrices on the CM-2

    Get PDF
    The performance is measured of the components of the key interative kernel of a preconditioned Krylov space interative linear system solver. In some sense, these numbers can be regarded as best case timings for these kernels. Sweeps were timed over meshes, sparse triangular solves, and inner products on a large 3-D model problem over a cube shaped domain discretized with a seven point template. The performance of the CM-2 is highly dependent on the use of very specialized programs. These programs mapped a regular problem domain onto the processor topology in a careful manner and used the optimized local NEWS communications network. The rather dramatic deterioration in performance was documented when these ideal conditions no longer apply. A synthetic workload generator was developed to produce and solve a parameterized family of increasingly irregular problems

    Improving the Performance of DSM Systems via Compiler Involvement

    No full text
    Distributed shared memory (DSM) systems provide an illusion of shared memory on distributed memory systems such as workstation networks and some parallel computers such as the Cray T3D and Convex SPP-1. This illusion is provided either by enhancements to hardware, software, or a combination thereof. On these systems, users can write programs using a shared memory style of programming instead of message passing which is tedious and error prone. Our experience with one such system, TreadMarks, has shown that a large class of applications do not perform well on these systems. TreadMarks is a software distributed shared memory system designed by Rice University researchers to run on networks of workstations and massively parallel computers. Due to the distributed nature of the memory system, shared memory synchronization primitives such as locks and barriers often cause significant amounts of communication. We have provided a set of powerful primitives that will alleviate the problems with..
    corecore